Conversation
Contributor
gnurizen
commented
Apr 9, 2026
- Rewrite parcagpu to use Proton's CUPTI infrastructure
- Update proton callback API names for upstream sync
- Update proton submodule to latest upstream sync
- Replace interval-based rate limiter with token bucket algorithm
- Add activity_batch USDT probe and fix test infrastructure
- Various fixes to make arm64 work
- And make amd64 compile too
- Small cleanups/formatting
- Shorten names
- Checkpoint PC sampling tweaking
- Stall reason map handling, prepping for batched pc samples
- Flush out cubin processing, sass lookup and pc sampling probe batching
- PC sampling: probabilistic windowed start/stop with KERNEL_SERIALIZED mode
- Cleanup related to usdt/cupti extraction
Major changes: - Use Proton as a git submodule for CUPTI callback handling - Rewrite in C++ using Proton's CuptiApi and callback patterns - Add PC sampling support for continuous GPU profiling - Simplify build to single library (works with any CUDA version at runtime) - Use CMake build system - Consolidate GitHub workflows into single build.yml - Update Dockerfile to Ubuntu 24.04 (fixes USDT probe generation) The library now uses Proton's dynamic CUPTI loading, so a single build works with CUDA 12.x and 13.x at runtime. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
setDriverCallbacks renamed to setLaunchCallbacks in upstream proton.
The simple 500μs interval check could only pass 2000 samples/sec regardless of actual load. The token bucket (configurable via PARCAGPU_RATE_LIMIT, default 100/sec) smooths bursts while maintaining a predictable average rate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add parcagpuActivityBatch() probe that fires with batches of up to 128 activity record pointers, enabling BPF consumers to read kernel timing data directly from CUPTI buffers without per-record probe overhead. Build/test changes: - Link test binary against mock CUPTI/CUDA with --no-as-needed so Proton's dlopen(RTLD_NOLOAD) finds the mocks at runtime - Fix make test to run the test binary directly with LD_LIBRARY_PATH (ctest had no tests registered) - Add make bpf-test and make test-multi targets for BPF activity parser integration testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… mode Implement interval-gated probabilistic PC sampling that only serializes kernels during active sampling windows, not for the entire process lifetime. Architecture: - CUPTI lifecycle: enable (once) → start/stop (per window) → disable (once) - Enable START_STOP_CONTROL attribute so start/stop work from CUPTI callbacks - Collection mode is KERNEL_SERIALIZED for per-kernel correlation - Probabilistic window: every PARCAGPU_PC_SAMPLING_INTERVAL seconds, roll a PARCAGPU_PC_SAMPLING_PROBABILITY die; if it hits, start sampling until the window closes, then stop and drain data - start()/stop() are mutex-guarded and idempotent (no double-start/stop races) - ctxSynchronize before start to satisfy CUPTI's GPU-idle requirement Key changes: - pc_sampling.cpp: Session-based enable with per-window start/stop, semaphore- gated stall reason map replay (replaces rate-limited emission), CUPTI 12.4 ABI version check (v22 correlationId boundary), graceful permission failure handling in enable - cupti.cpp: Probabilistic window state machine in ENTER/EXIT callbacks, env var config (probability, interval), env_config validation - probes.d: Add error USDT probe for surfacing CUPTI failures to BPF - test/mock_cupti.c: Full PC sampling mock with real cubin from pc_sample_toy, real SASS offsets for source-line correlation, 11-entry sample table cycling through shmem_bounce/hash_churn/trig_storm kernels - test/mock_cuda.c: Add cuCtxSynchronize stub - test/test-pc-mock.sh: New GPU-less test using mock libs and real cubin - test/test-pc-real.sh: Set probability=1 interval=0.5 for reliable test hits - test/bpf/: Move CUPTI struct defs to shared cupti_bpf.h, add error event handling, CUDA 12.4+ correlationId support - test/CMakeLists.txt: Build mock CUDA driver library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
usdt headers now live in parca-dev/usdt and the cupti bpf headers now live in this project. So we don't need to vendor otel anymore.
Contributor
Author
|
I'm gonna squash first and redo this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.